Efficiencies for Working with IPEDS Peer Data

Alex McClung

Data Resource Analyst

April 3, 2023

Outline

⚡ Automating & documenting custom IPEDS Data Center downloads

⚡ Working with multiple years of IPEDS “Complete” data files

Motivation

Peer data needed for 14 Yale school profiles

IPEDS Data Center Downloads

Browser automation with Playwright Test for Python 🎭

Playwright Test for Python 🎭

Save, modify, and run the codegen recorded code

from playwright.sync_api import Playwright, sync_playwright, expect

def run(playwright: Playwright) -> None:
    browser = playwright.chromium.launch(headless=False, slow_mo=50)
    ### RECORD VIDEO
    context = browser.new_context(record_video_dir="videos/")
    page = context.new_page()
    page.goto("https://nces.ed.gov/ipeds/use-the-data")
    page.get_by_role("link", name="Compare Institutions", exact=True).click()
    ### ADD/CHANGE PEER SETS
    page.locator("#tbInstitutionSearch").fill("217156,190150,190415,182670,166027,166683,186131,243744,144050,215062,130794")
    page.get_by_role("button", name="Select").click()
    page.get_by_text("Check All", exact=True).click()
    page.get_by_role("button", name="Continue").click()
    page.get_by_role("link", name="Continue").click()
    page.get_by_text("Admissions and Test Scores").click()
    page.get_by_text("Admissions and test scores", exact=True).click()
    page.get_by_text("Number of applications, admissions, and enrollees").click()
    ### ADD/CHANGE YEARS
    page.get_by_label("2021-22").check()
    page.get_by_label("Applicants total").check()
    page.get_by_label("Admissions total").check()
    page.get_by_label("Enrolled total").check()
    ### SAVE SCREENSHOTS
    page.screenshot(path="admissions-screenshot.png")
    page.get_by_role("img", name="Continue").click()
    page.get_by_role("img", name="Continue").click()
    with page.expect_download() as download_info:
        page.get_by_role("button", name="Continue").click()
    download = download_info.value
    ### EXPORT DATA AS CSV
    download.save_as("ivyplus-admissions.csv")

    # ---------------------
    context.close()
    browser.close()

with sync_playwright() as playwright:
    run(playwright)

IPEDS Complete Data Files

Working with multiple years of data with Arrow

  • Read multiple CSVs at once (or, as one)

  • Query data without reading it into memory

  • Add year of the filename to the data set

Arrow Example in R

Reading multiple CSVs with arrow::open_dataset()

library(arrow)
## EXAMPLE IPEDS DATA FILES (~60 Mb each)  
dir('data')
[1] "c2017_a.csv" "c2018_a.csv" "c2019_a.csv" "c2020_a.csv" "c2021_a.csv"
tic()
ds <- open_dataset('data', format = 'csv')
toc()
0.16 sec elapsed
## WHAT KIND OF OBJECT? OBJECT SIZE? 
class(ds); print(object.size(ds), units = "auto")
[1] "FileSystemDataset" "Dataset"           "ArrowObject"      
[4] "R6"               
504 bytes

Arrow Example in R

Query data without reading it into memory

ds <- open_dataset('data', format = 'csv') %>%
  ## YOUR QUERY HERE  
  filter(UNITID == 130794) %>% 
  group_by(AWLEVEL) %>% 
  summarize(Degrees = sum(CTOTALT, na.rm=TRUE))

## WHAT KIND OF OBJECT? OBJECT SIZE? 
class(ds); print(object.size(ds), units = "auto")
[1] "arrow_dplyr_query"
8.2 Kb

Arrow Example in R

Get year with add_filename() and collect() data into memory

open_dataset('data', format = 'csv') %>% 
  filter(UNITID == 130794, MAJORNUM == 1, CIPCODE == 99) %>% 
  mutate(file_name = add_filename(),
         Year = gsub('.*c([0-9]+)_a.csv', '\\1', file_name)) %>% 
  group_by(Year) %>% 
  summarise(Degrees = sum(CTOTALT, na.rm=TRUE)) %>% 
  collect() ## INTO MEMORY
# A tibble: 5 × 2
  Year  Degrees
  <chr>   <int>
1 2017     4345
2 2018     4494
3 2019     4789
4 2020     4863
5 2021     4121

What else can you do with Arrow?

You do not need to use it with multiple files

tic()
csv_df <- read_csv_arrow('data/c2021_a.csv', as_data_frame = FALSE)
toc()
0.55 sec elapsed

parquet data file format is “the new CSV” 😎

tic()
parquet_df <- read_parquet('parquet/c2021_a.parquet')
toc()
0.44 sec elapsed
fs::file_size('parquet/c2021_a.parquet')
5.99M

Thank You!